Test of geocoding

Problem Statement:

Between 2015 and our current datset there are 15676 missing geographies. We are attempting to geocode these geographies, using two different methods.

Technique 1:

If a value for x and y is present for the proceeding and following year & both those values are within .0001 of eachother, we use the average of those two values. (imputed)

Technique 2:

We geocode the addresses from openstreet maps. (geocoding)

Results:

The number of PIN10 geographies filled by imputing is 8482 and the number of geographies filled by geocoding is 6563.

Imputed PINs:

Geocoded PINs:

Map of all PINs with results from both techniques:

To test how these work, we join them back together again and create a distance in ft collumn. This is the difference between observations with both imputed and geocoded techniques. The result shows that both techniques are very similar, although there are some differences.

Pins with Long Distances ( > 1000 ft):

There are 440 matched geographies over 1,000 ft apart out of 9648. Of these, only 2476 are from 2015 or later.

The main (fixable) issue that I saw here was that directions of streets were not correct (359 W Ohio is likely 359 E Ohio).

Data post 2015

Example of large parcel pin

The biggest issue were pins like this, which are on very large parcels. I’m still doing some QC on them, but this is a pretty expected outcome between the two techniques, where geocoding seems to put a parcel directly on the street, and imputing in the center of the pin.

Histogram of Distance (ft) Between Techniques

The vast majority of pins are within 300 feet of eachother. An example of this would be Address: 1346 W CULLERTON ST 3 CHICAGO IL 60608, CHICAGO, IL. These are 185 feet apart, and three houses apart. While not ideal, it definetely seems to be an improvement over no information. The selection of 5,000 ft difference was largely arbitrary, but shows that data seems pretty consistent.

What this does not answer is if PINs which have only one one technique are harder to code (weirder addresses). There is no intuitive reason why this would be the case, but it is something we should think about.

Conclusion:

In general, the two techniques produce similar results. I’d have to go through them a bit more in depth, but at the moment, the geocoding looks a bit better, specifically because it fills more observations. On the other hand, the imputed values are more likely to be correct, as we know we have some issues with our address data.